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Abstract 

We have developed a full discourse parser in the Penn Discourse Treebank 
(PDTB) style. Our trained parser first identifies all discourse and non-discourse re- 
lations, locates and labels their arguments, and then classifies their relation types. 
When appropriate, the attribution spans to these relations are also determined. We 
present a comprehensive evaluation from both component- wise and error-cascading 
perspectives. 

1 Introduction 

A piece of text is often not to be understood individually, but understood by linking 
it with other text units from its context. These units can be surrounding clauses, sen- 
tences, or even paragraphs. A text becomes semantically well-structured and under- 
standable when its text units are linked interstructurally from the bottom up. 

Even when a text is well-structured, finding the discursive relationships that hold 
a text together automatically is difficult. In natural language processing (NLP), the 
process of understanding the internal structure of a text has been called discourse anal- 
ysis, while the process of reahzing the semantic relations in between text units has 
been called discourse parsing. Over the last couple of decades, researchers have pro- 
posed a number of discourse frameworks from different perspectives for the purpose of 



discourse analysis and parsing (|Mann and Thompson 


|1988 Hobbs 1990 Lascarides 


and Asher 1993 Knott and Sanders 1998 Webber 


2004|l. However, designing and 



constructing such a discourse analyzer or parser has been a difficult task, partially at- 
tributable to the lack of any large annotated data set. 



The Penn Discourse Treebank (PDTB) ( [Prasad et al.| |2008| l is a recently released, 
discourse-level annotation on top of the Penn Treebank (PTB), which aims to fill this 
need. Providing a common platform for discourse researchers, it is the first annotation 
that follows the lexically grounded, predicate-argument approach, as proposed in Web- 
ber's framework (2004). In our work, we have (a) designed a parsing algorithm that 
performs discourse parsing in the PDTB representation, and (b) implemented an end- 
to-end system that reduces this algorithm to practice in a fully data driven approach. 
This system includes components that are novel as well as improved components from 
previous work. To the best of our knowledge, this is the first parser that performs end- 
to-end discourse parsing in the PDTB style. The demo and source code of the parser 
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have been released onlin^U 



2 Related Work 

Mann and Thompson (|1988 1 proposed Rhetorical Structure Theory (RST) which takes 



a nucleus-satellite view on rhetorical relations. Marcu (1997 ) formalized an algorithm 
to parse an unrestricted text into its discourse tree using the RST framework. He made 
use of cue phrases to split a sentence into elementary discourse units (edus), designed 
algorithms that are able to recognize discourse relations with or without the signals 
of cue phrases, and proposed four algorithms for determining the valid discourse tree 
given the relations of adjacent edus. 

Continuing this vein, Soricut and Marcu ("2003^ introduced probabilistic models 
to segment a sentence into edus, and to derive their corresponding sentence-level dis- 
course structure, using lexical and syntactic features. They experimented with their 



models using the RST Discourse Treebank (RST-DT) corpus (Carlson et al. 2001 1. 

Recently, duVerle and Prendinger (2009j made use of a support vector machines 
(SVM) approach, using a rich set of shallow lexical, syntactic, and structural features, 
to train two separate classifiers on identifying the rhetorical structures and labeling the 
rhetorical roles drawn from the RST-DT. 

With the advent of the larger PDTB, some recent work has attempted to recognize 
discourse relations and arguments in this newer corpus. Using syntactic features ex- 
tracted from the parse trees, Pitler and Nenkova ( 2009) introduced a model that is able 
to disambiguate the discourse usage of connectives and recognize Explicit relations. 
Wellner and Pustejovsky (2007 ), Elwell and Baldridge (5008'), and Wellner ("2009") pro- 
posed machine learning approaches to identify the head words of the two arguments 
for discourse connectives. Although their method is capable of locating the positions 
of the arguments, it is not able to label the extent of these arguments. Machine learning 
approaches are used to identify Implicit relations (i.e., discourse relations that are not 



signaled by discourse connectives such as because) in Pitler et al. ( 2009 1 and our pre- 
vious work ( |Lin et all] [2009 i All of these research efforts in the PDTB can be viewed 
as isolated components of a full parser. Our work differs from these prior efforts in that 
we design a parsing algorithm that connects all sub-tasks into a single pipeline, and we 
implement this pipeline into an end-to-end parser in the PDTB style. 

Component-wise, we introduce two novel approaches to accurately locate and label 
arguments, and to label attribution spans. We also significantly improve on the current 
state-of-the-art connective classifier with newly introduced features. 



3 The Penn Discourse Treebank 

The Penn Discourse Treebank (PDTB) adopts a binary predicate-argument view on 
discourse relations, where the connective acts as a predicate that takes two text spans 
as its arguments. The span to which the connective is syntactically attached is called 
Arg2, while the other is called Argl. The PDTB provides annotation for each discourse 
connective and its two arguments. Example [T] shows one Explicit relation where the 
connective is underlined , Argl is italicized and Arg2 is bolded. The number at the end 
(0214) shows which Wall Street Journal (WSJ) article this relation is from. 

http: //wing. comp . nus . edu . sg/ -linzihen /parser/ 
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(1) When he sent letters offering 1,250 retired major leaguers the chance of 
another season, 730 responded. (0214) 



The PDTB also examined sentence pairs within paragraphs for discourse relations 
other than Explicit. Example |2] shows an Implicit relation where the annotator in- 
ferred an implicit connective accordingly. Some relations are alternatively lexicalized 
by non-connective expressions. Example [3] is such an AltLex relation with the non- 
connective expression That compared with. If no Implicit or AltLex relation exists 
between a sentence pair, annotators then checked whether an entity transition (EntRel) 
holds, otherwise no relation (NoRel) was concluded. 

(2) "/ believe in the law of averages," declared San Francisco batting coach 



Dusty Baker after game two. Implicit = ACCORDINGLY "I'd rather see a 
so-so hitter who's hot come up for the other side than a good hitter who's 
cold." (2202) 

(3) For the nine months ended July 29, SFE Technologies reported a net loss of 
$889,000 on sales of $23.4 million. AltLex [That compared with] an 
operating loss of $1.9 million on sales of $27.4 million in the year-earlier 
period. (0229) 

The PDTB also provides a three-level hierarchy of relation types. In this work, 
we follow our previous work ( [Lin et al.[ |2009| l and focus on the Level 2 types. For 
each discourse relation (i.e.. Explicit, Implicit, or AltLex) the PDTB also provides 
annotation for the attribution (i.e., the agent that expresses the argument) for Argl, 
Arg2, and the relation as a whole. For example, the text span in the box in Example|2]- 
declared San Francisco batting coach Dusty Baker after game two - is the attribution 
span for Arg 1 . 



4 System Overview 

We designed our parsing algorithm to mimic the annotation procedure performed by 
the PDTB annotators. Figure [T] shows the pseudocode. The input to the parser is 
a free text T, whereas the output is the discourse structure of T in the PDTB style. 
The algorithm consists of three steps which sequentially label Explicit relations, Non- 
Explicit relations, and attribution spans. 

The first step is to identify discourse connectives, label their Argl and Arg2 spans, 
and recognize their Explicit relation types. First, the parser identifies all connective 
occurrences in T (Line 2 in Figure [T]), and labels them as to whether they function as 
discourse connectives or not (Lines 3^). If a connective occurrence C is determined to 
be a discourse connective, its Argl and Arg2 spans are then identified, and the parser 
classifies the tuple (C, Argl, Arg2) into one of the Explicit relation types (Lines 5- 
7). The second step then examines all adjacent sentence pairs within each paragraph. 
For each pair {Si, Sj) that is not identified in any Explicit relation from Step 1, the 
parser then classifies the pair into EntRel, NoRel, or one of the Implicit/AltLex re- 
lation types (Lines 10-13). Note that our parser follows the PDTB representation to 
ignore inter-paragraph relations, i.e., it ignores the adjacent sentence pair in between 
two paragraphs. In Step 3, the parser first splits the text into clauses (Line 16), and 
for each clause U that appears in any discourse relations {i.e.. Explicit, Implicit, and 



3 



AltLex relations; EntRel and NoRel are non-discourse relations), it checks whether U 
is an attribution span (Lines 17-19). In this step, the parser also follows the PDTB 
representation to only identify attribution spans appearing in discourse relations. 

Input: a text T 

Output: a discourse structure of T 



II Step 1: label Explicit relations 
Identify all connective occurrences in T 
for each connective occurrence C do 
Label C as disc-conn or non-disc-conn 
if C is disc-conn then 
Label Argl span and Arg2 span of C 
Label (C, Argl, Arg2) as one of the Explicit relations 

// Step 2: label Implicit, AltLex, EntRel, and NoRel relations 
for each paragraph P in T do 

for each adjacent sentence pair (Si, Sj) in P do 

if (Si, Sj) is not labeled as an Explicit relation in Step 1 then 

Label (Si, Sj) as EntRel, NoRel, or one of the Implicit/ AltLex relations 

// Step 3: label attribution spans 
Split T into clauses 
for each clause U do 

if [/ is in some Explicit/Implicit/AltLex relation from Step 1 or 2 then 
Label U as attr-span or non-attr-span 



Figure 1 : Pseudocode for the discourse parsing algorithm. 

The pipeline of the parser is shown in Figure [2j which consists of the connective 
classifier, argument labeler, explicit classifier, non-explicit classifier, and attribution 
span labeler. The first three components correspond to Step 1 in Figure [T] while the 
last two correspond to Steps 2 and 3, respectively. There are two sub-components in 
the argument labeler: an argument position classifier and an argument extractor A 
detailed description of these components follows in the next section. 
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Figure 2: System pipeline for the discourse parser. 
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5 Components 



5.1 Connective Classifier 

There are 100 types of discourse connectives defined in the PDTB. Given a connective 
occurrence such as and, the parser needs to decide whether it is functioning as a dis- 
course connective. Pi tier and Nenkova ("2009) showed that syntactic features extracted 
from constituent parse trees are very useful in disambiguating discourse connectives. 
Beside the connective itself as a feature, they applied other syntactic features: the 
highest node in the tree that covers only the connective words (which they termed self 
category), the parent, left and right siblings of the self category, and two binary fea- 
tures that check whether the right sibling contains a VP and/or a trace. The best feature 
set they demonstrated also included pairwise interaction features between the connec- 
tive and each syntactic feature, and the interaction features between pairs of syntactic 
features. 

In addition to the above, we observed that a connective's context and part-of-speech 
(POS) give a very strong indication of its discourse usage. For example, the connective 
after is usually functioning as a discourse connective when it is found followed by a 
present participle, as in "after rising 3.9%". Based on this observation, we propose 
a set of lexico-syntactic features for a connective C with its previous word prev and 
next word next: C POS, prev + C, prev POS, prev POS + C POS, C + next, next 
POS, and C POS + next POS. We also include as features the path from C to the root, 
and the compressed path where adjacent identical tags are combined (e.g., -VP- VP- is 
combined into -VP-). 

5.2 Argument Labeler 

The parser now labels the Argl and Arg2 spans of every discourse connective, in two 
steps: (1) identifying the locations of Argl and Arg2, and (2) labeling their extent. We 
note that Arg2 is the argument with which the connective is syntactically associated, 
and thus is fixed. The remaining problem is in identifying the location of Argl. We 
implement this as a classification task to recognize the relative position of Argl, with 
respect to the connective. According to the different relative positions of Argl, the 
argument extractor then attempts to extract the Argl and Arg2 spans. Figure [3] gives 
the pseudocode for the argument labeler, which is further discussed in the following. 

Input: a discourse connective C and the text T 
Output: Argl and Arg2 spans of C 

1: // Argument position classifier 

2: Classify the relative position of Argl as SS or PS 

3: 

4: // Argument extractor 

5: if the relative position of Argl is SS then 

6: Identify the Argl and Arg2 subtree nodes within the sentence parse tree 

7: Apply tree subtraction to extract the Argl and Arg2 spans 

8: else // the relative position of Argl is PS 

9: Label the sentence containing C as Arg2 

10: Identify and label the Argl sentence from all previous sentences of Arg2 
Figure 3: Pseudocode for the argument labeler. 
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Argument Position Classifier 



Prasad et al. (2008 1 described the demographic breakdown of the positions of Argl in 
their study of the PDTB annotations. They showed that Argl can be located within 
the same sentence as the connective (SS), in some previous sentence of the connective 
(PS), or in some sentence following the sentence containing the connective (FS). PS is 
further divided into: in the immediately previous sentence of the connective (IPS) and 
in some non-adjacent previous sentence of the connective (NAPS). The distribution 
from their paper shows that 60.9% of the Explicit relations are SS, 39.1% are PS, and 
0% are FS (only 8 instances in the whole PDTB corpus). 

Motivated by this observation, we design an argument position classifier to identify 
the relative position of Argl as SS or PS. We ignore FS since there are too few train- 
ing instances. We notice that the connective string itself is a very good feature. For 
example, when the connective token is And {i.e., and with its first letter capitalized), 
it is a continuation from the previous sentence and thus Argl is likely in PS; whereas 
when the connective token is lowercase and, Argl is likely the clause at the left hand 
side of and and thus it is in SS. Additionally, some connectives always take a particular 
position. For example, when always indicates an SS case, whereas additionally always 
indicates PS. 

Besides the connective string, we use the following contextual features in the clas- 
sifier for the connective C with its first and second previous words previ and prev2'. 
position of C in the sentence (start, middle, or end), C POS, previ, previ POS, previ 
+ C, previ POS + C POS, prev2, prev2 POS, prev2 + C, and prev2 POS + C POS. 

After the relative position of Argl is identified, the result is propagated to the argu- 
ment extractor, which extracts the Argl and Arg2 spans accordingly. 

Argument Extractor 

When Argl is classified as in the same sentence (SS), this means that Argl, Arg2 and 
the connective itself are in the same sentence. This can be further divided into three 
cases: Argl coming before Arg2, Argl coming after Arg2, and Arg2 embedded within 
Argl . One possible approach is to spUt the sentence into clauses before deciding which 
clause is Argl or Arg2. The problem with this approach is that it is not able to recognize 
the third case, where Arg2 divides Argl into two parts. 



Dinesh et al. (2005 i showed that Argl and Arg2 in the same sentence for subordi- 
nating connectives are always syntactically related as shown in Figure|4|a), where Argl 
and Arg2 nodes are the lowest nodes that cover the respective spans. They demon- 
strated that a rule-based algorithm is capable of extracting Argl and Arg2 in such 
cases for subordinating connectives. By using tree subtraction, the third case men- 
tioned above can be easily recognized (span 2 in Figure |4| a) divides Argl into spans 1 
and 3). 

However, dealing with only the subordinating connectives is not enough, because 
the percentages of coordinating connectives and discourse adverbials for SS cases oc- 
cupy up to 37.50% and 21.57%, respectively, in the whole PDTB. We observe that 
coordinating connectives {and, or, but, etc.) usually constrain Argl and Arg2 to be 
syntactically related in one of two ways as shown in Figure |4|b)-(c), where CC is the 
connective POS. Discourse adverbials do not demonstrate such syntactic constraints as 
strongly as subordinating and coordinating connectives do, but their Argl and Arg2 
are also syntactically bound to some extent. For example, Figure[5]shows the syntactic 
relation of Argl and Arg2 nodes for the discourse adverbial still in Example]?] 
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Root 




Figure 4: Syntactic relations of Argl and Arg2 subtree nodes in the parse tree. Note 
that it is not possible for Argl node to be embedded in Arg2 node. 




that 



Figure 5: Part of the parse tree for Example [4] with Argl and Arg2 nodes labeled. 



(4) Lilly, where the highest New York court expanded the market-share approach 



for the first time to say that drug makers that could prove Mindy Hymowitz's 



mother didn 't use their pill must still pay their share of any damages. (0130) 

We design our argument node identifier to first identify the Argl and Arg2 subtree 
nodes within the sentence parse tree for all subordinating connectives, coordinating 
connectives and discourse adverbials, then apply tree subtraction to extract the Argl 
and Arg2 spans. The argument node identifier labels each internal node with three 
probabilities: functioning as Argl -node, Arg2-node, and None. The internal node with 
the highest Argl -node probability is chosen as the Argl node, and likewise for Arg2 
node. The subtree under the Arg2 node is then subtracted from the Argl subtree to 
obtain the Argl spans, and the connective is subtracted from the Arg2 subtree to obtain 
the Arg2 span. Motivated by the syntactic properties observed, we propose the follow- 
ing features: the connective C, its syntactic category (subordinating, coordinating, or 
discourse adverbial), numbers of left and right siblings of C, path P of C to the node 
under consideration, the path P and whether the size of C's left sibling is greater than 
one, and the relative position of the node to C (left, middle, or right). A maximum 
entropy classifier is used as it estimates class probabilities. 

For the PS case where Argl is located in one of the previous sentences, the majority 
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classifier labels the immediately previous sentence as Argl, which already gives an Fi 
of 76.90% under gold standard setting in the whole PDTB. Since the focus of our 
work is not on identifying the Argl sentences for the PS case, we employ the majority 
classifier as our classifier. 



5.3 Explicit Classifier 



After identifying a discourse connective and its two arguments, the next step is to de- 
cide what Explicit relation it conveys. Prasad et al. ( |2008) l reported a human agreement 
of 94% on Level 1 classes and 84% on Level 2 types for Explicit relations over the 
whole PDTB corpus. The connective itself is a very good feature, as only a few con- 



nectives are ambiguous as pointed out in ( [Miltsakaki et al.||2005) . We train an explicit 
classifier using three types of features: the connective, the cormective's POS, and the 



connective + its previous word. We follow our previous work ( Lin et al. 2009| to train 
and test on the 16 Level 2 types. 



5.4 Non-Explicit Classifier 

Besides annotating Explicit relations, the PDTB also provides annotation for Implicit 
relations, AltLex relations, entity transition (EntRel), and otherwise no relation (NoRel). 
We lump these together as Non-Explicit relations. The Non-Explicit relations are anno- 
tated for all adjacent sentence pairs within paragraphs. Furthermore, if there is already 
an Explicit relation between two adjacent sentences, their discourse relationship is al- 
ready determined and are therefore exempt from further examination. 

Similar to the explicit classifier, we adapt the Level 2 types for the Implicit and 
AltLex relations. As there are too few training instances for Condition, Pragmatic 
Condition, Pragmatic Contrast, Pragmatic Concession, and Exception relations, these 
five types are removed, resulting in 1 1 types. Thus, our Non-Explicit classifier assigns 
candidate sentence pairs to one of 13 types (11 Level 2 types plus EntRel and NoRel). 

We apply the four feature sets from our previous work ( |Lin et aL||2009| l: contex- 
tual features (which check the existence of surrounding relations), constituent parse 
features, dependency parse features, and word-pair features. Besides these, we pro- 
pose three new features to capture AltLex relations. AltLex relations are very similar 
to their counterpart Implicit relations, except that they are alternatively lexicalized by 
some non-connective expressions. We observe that such non-connective expressions 
are usually attached to the beginning of Arg2 {e.g., such as "That compared with" in 
Arg2 of Example [3]). To distinguish AltLex relations, we use three features that take 
the first three words of Arg2 as their respective values. For the example above, the 
features will be wordi=that, word2=compared, and word3=with. 



5.5 Attribution Span Labeler 

For each discourse relation (i.e.. Explicit, Implicit, or AltLex relation), the PDTB an- 
notators labeled the attribution spans and annotated four dimensions for Argl, Arg2, 
and the relation: their sources, types, scopal polarities, and determinacy. For the cur- 
rent parser, we develop a component to label the attribution spans, without labeling the 
four attribution dimensions and direction (Argl, Arg2, or the relation) it is associated 
with. We follow the PDTB to only label attribution spans within discourse relations. 

The attribution span labeler consists of two steps: splitting the text into clauses, 
and deciding which clauses are attribution spans. In the first step we employ a clause 
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splitter that we have developed which uses syntactically motivated approach similar to 
(Skadhauge and Hardt 2005 [ l. This clause splitter makes use of punctuation symbols 
and syntactic structures of SBAR complements. 

The attribution span labeler then classifies each clause into attr-span or non-attr- 
span. We propose the following features extracted from the current, previous, and next 
clauses (curr, prev, and next): unigrams of curr, lowercased and lemmatized verbs 
in curr, the first and last terms of curr, the last term of prev, the first term of next, 
the last term of prev + the first term of curr, the last term of curr + the first term of 
next, the position of curr in the sentence (start, middle, end, or whole sentence), and 
production rules extracted from curr. Some clauses that belong to single attribution 
spans are incorrectly split into more than one clause by the clause splitter. For example, 
"he said, adding" is annotated as a single attribution span in the PDTB, but it is split 
into two clauses "he said," and "adding". To correct such mistakes, after classification, 
adjacent attribution clauses within a sentence are combined to form a single attribution 
span. 



6 Evaluation 



In all of our experiments, we follow the recommendation from ( jPDTB-Group 2007 | l 



to use Sec. 02-21 for training. Sec. 22 for development, and Sec. 23 for testing. All 
classifiers are trained with the OpenNLP maximum entropy packag^ 

For each component, the experiments are carried out when there is no error prop- 
agated from the previous components {i.e., using gold standard annotation for the pre- 
vious components), and when there is error propagation. As the PDTB was annotated 
on top of the PTB, we can either use the gold standard parse trees and sentence bound- 
aries, or we can apply an automatic parser and sentence sphtter. The experiments are 
carried out under three settings for each component: using gold standard parses and 
sentence boundaries (GS) without error propagation (EP), using GS with BP, and us- 
ing both automatic parsing and sentence splitting (Auto) with EP. Thus GS without EP 
corresponds to a clean, per component evaluation, whereas the Auto with EP setting as- 
sesses end-to-end fully automated performance (as would be expected on new, unseen 
text input). 

On the connective classifier, Pitler and Nenkova (2009^ (P&N) reported an accu- 
racy of 96.26% and Fi of 94.19% with a 10-fold cross validation on Sec. 02-22. To 
compare with P&N, we also run a 10-fold CV on Sec. 02-22 using their features and 
obtain replicated accuracy of 96.09% and replicated Fi of 93.57%. Adding in our 
lexico-syntactic and path features, the performance is increased to 97.25% accuracy 
and 95.36% Fi, improvements of 0.99% and 1.17% over the reported results and 1.16% 
and 1.79% over the replicated results. A paired t-test shows that the improvements over 
the replicated results are significant with p < 0.0010 

In Table [T] we report results from the connective classifiers trained on Sec. 02-21 
and tested on Sec. 23. The second and third columns show the accuracy and Fi using 
the features of P&N, whereas the last two columns show the results when we add in the 
lexico-syntactic and path features (H-new). Introducing the new features significantly 
(all with p < 0.001) increases the accuracy and Fi by 2.04% and 3.01% under the GS 
setting, and 1.81% and 2.62% under the Auto setting. This confirms the usefulness of 



http : / /maxent . sourcef orge . net/ 

it is not possible to conduct paired t-test on the reported results for P&N as we do not have the predic- 
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integrating the contextual and syntactic information. As the connective classifier is the 
first component in the pipeline, its high performance is crucial to mitigate the effect of 
cascaded errors downstream. 

When we look into the incorrectly labeled connectives, we find that the connective 
with the highest number of incorrect labels is and (8 false negatives and 4 false positives 
for the GS setting), which is not surprising, as and is always regarded as an ambiguous 
connective. 





P&N 


+new 


Acc. 




Acc. 


Fi 


GS 


95.30 


92.75 


97.34 


95.76 


Auto 


94.21 


91.00 


96.02 


93.62 



Table 1 : Results for the connective classifier. 



We next perform evaluation on the argument position classifier, and report micro 
precision, recall and Fi, as well as the per class Fi. The GS + no EP setting gives a 
high Fi of 97.94%, which drops 3.59% and another 2.26% when error propagation and 
full automation are added in. The per class Fi shows the performance degradation is 
mostly due to the SS class: the drops for SS are 5.36% and 3.35%, compared to 1.07% 
and 0.68% for PS. When we look into the contingency table for the GS + EP setting, 
we notice that out of the 36 false positives propagated from the connective classifier, 30 
of them are classified as SS; for the Auto + EP setting there are 46 out of 52 classified 
as SS. This shows that the difference in the performance drops for SS and PS is largely 
due to the error propagation but not the classes themselves. 





Prec. 


Recall 


^^1 


Per class Fi 


SS 


PS 


GS + no EP 


97.94 


97.94 


97.94 


98.26 


97.49 


GS + EP 


94.66 


94.04 


94.35 


92.90 


96.42 


Auto + EP 


92.75 


91.44 


92.09 


89.55 


95.74 



Table 2: Results for the argument position classifier. 



We next evaluate the performance of the argument extractor. Table [3] illustrates the 
results of identifying the Argl and Arg2 subtree nodes for the SS case for the three 
connective categories. The last column shows the relation level Fi which requires both 
Argl and Arg2 nodes to be matched. We only show the results for the GS + no EP 
setting to save space. As expected, Argl and Arg2 nodes for subordinating connectives 
are the easiest ones to identify and give a high Arg2 Fi of 97.93% and a Rel Fi of 
86.98%. We note that the Argl Fi and Arg2 Fi for coordinating connectives are the 
same, which is strange, as we expect Arg2 nodes to be handled more easily. The error 
analysis shows that Arg2 spans for coordinating connectives tend to include extra texts 
that cause the Arg2 nodes to move lower down in the parse tree. For example, "... and 
Mr. Simpson said he resigned in 1988" contains the extra span "Mr Simpson said" 
which causes the Arg2 node moving two levels down the tree. As we discussed, dis- 
course adverbials are difficult to identify as their Argl and Arg2 nodes are not strongly 
bound in the parse trees. However, as they do not occupy a large percentage in the test 
data, they do not lead to a large degradation as shown in the last row. 

Miltsakaki et al. (j2004j) reported human agreements on both exact and partial matches 
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Argl 


Arg2 


RelFi 


Subordinating 


88.46 


97.93 


86.98 


Coordinating 


90.34 


90.34 


82.39 


Discourse adverbial 


46.88 


62.50 


37.50 


All 


86.63 


93.41 


82.60 



Table 3: Results for identifying the Argl and Arg2 subtree nodes for the SS case under 
the GS + no EP setting for the three categories. 



to be 90.2% and 94.5%, respectively. They found that most of the disagreements for 
exact match come from partial overlaps which do not show significant semantic differ- 
ence. We follow such work and report both exact and partial matches. When checking 
exact match, we require two spans to match identically, excluding any leading and end- 
ing punctuation symbols. A partial match is credited if there is any overlap between the 
verbs and nouns of the two spans. The results for the overall performance for both SS 
and PS cases are shown in Table |4] The GS + no EP setting gives a satisfactory Fi of 
86.24% for partial matching on relation level. On the other hand, the results for exact 
matching are much lower than the human agreement. We observe that most misses are 
due to small portions of text being deleted from or added to the spans by the annota- 
tors to follow the minimality principle to include in the argument the minimal span of 
text that is sufficient for the interpretation of the relation, which poses difficulties for 
machines to follow. 







Argl Fi 


Arg2 Fi 


RelFi 


Partial 


GS + no EP 


86.67 


99.13 


86.24 


GS + EP 


83.62 


94.98 


83.52 


Auto + EP 


81.72 


92.64 


80.96 


Exact 


GS + no EP 


59.15 


82.23 


53.85 


GS + EP 


57.64 


79.80 


52.29 


Auto + EP 


47.68 


70.27 


40.37 



Table 4: Overall results for the argument extractor. 



Following the pipeline, we then evaluate the explicit classifier, with its performance 
shown in Table|5] Recall that human agreement on Level 2 types is 84.00% and a base- 
Une classifier that uses only the connectives as features yields an Fi of 86.00% under 
the GS + no EP setting on Sec. 23. Adding our new features improves Fi to ^6.77%. 
With full automation and error propagation, we obtain an Fi of 80.61%. Pitler and 
Nenkova ( ,2009) show that using the same syntactic features as their connective classi- 
fier is able to improve the explicit classifier on a 10-fold cross validation on Sec. 02-22. 
This actually performs worse than the baseline when trained on Sec. 02-21 and tested 
on Sec. 23. 

For the non-explicit classifier, a majority class baseline that labels all instances 
as EntRel yields an Fi in the low 20s, as shown in the last column of Table [6] A 
single component evaluation (GS + no EP) shows a micro Fi of 39.63%. Although 
the Fi scores for the GS + EP and Auto + EP settings are unsatisfactory, they still 



11 





Precision 


Recall 


Fi 


GS + no EP 


86.77 


86.77 


86.77 


GS +EP 


83.19 


82.65 


82.92 


Auto + EP 


81.19 


80.04 


80.61 



Table 5: Results for the explicit classifier. 



significantly outperform the majority class baseline by about 6%. This performance is 
in line with the difficulties of classifying Implicit relations discussed in detail in our 
previous work ( |Lin et al!]|2009| l. 





Precision 


Recall 


Fi 


Baseline Fi 


GS + no EP 


39.63 


39.63 


39.63 


21.34 


GS + EP 


26.21 


27.63 


26.90 


20.30 


Auto + EP 


24.54 


26.45 


25.46 


19.31 



Table 6: Results for the non-explicit classifier. 

The final component, the attribution span labeler, is evaluated under both partial 
and exact match, in accordance with the argument extractor From Table |7j we see that 
the GS + no EP setting achieves Fi scores of 79.68% and 65.95% for the partial and 
exact match, respectively. When error propagation is introduced, the degradation of Fi 
is largely due to the drop in precision. This is not surprising as at this point, the test 
data contains a lot of false positives propagated from the previous components. This 
has effect on the precision calculation but not recall (the recall scores do not change). 
When full automation is further added, the degradation is largely due to the drop in 
recall. This is because the automatic parser introduces noise that causes errors in the 
clause splitting step. 







Precision 


Recall 


Fi 


Partial 


GS + no EP 


79.40 


79.96 


79.68 


GS + EP 


65.93 


79.96 


72.27 


Auto + EP 


64.40 


51.68 


57.34 


Exact 


GS + no EP 


65.72 


66.19 


65.95 


GS + EP 


54.57 


66.19 


59.82 


Auto + EP 


47.83 


38.39 


42.59 



Table 7: Results for the attribution span labeler. 

To evaluate the whole pipeline, we look at the Explicit and Non-Explicit relations 
that are correctly identified. We define a relation as correct if its relation type is clas- 
sified correctly, and both its Argl and Arg2 are partially or exactly matched. Under 
partial matching, the GS + EP setting gives an overall system Fi of 46.80%, while 
under exact matching, it achieves an Fi of 33.00%. Auto + EP gives 38.18% Fi for 
partial match and 20.64% Fi for exact match. Expectedly, a large portion of the misses 
come from the Non-Explicit relations. The GS + EP results are close to the system Fi 
of 44.3% of an RST parser reported in (duVerle and Prendinger 2009| l. 
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7 Future Work 



In our explicit classifier, although the tuple (C, Argl, Arg2) is passed into the classifier, 
the current approach does not make use of information from Argl and Arg2. One 
future work is to extract informative features from these two arguments for the explicit 
classifier. The current approach also does not deal with identifying Argl from all 
previous sentences for the PS case. We plan to design a PS identifier and integrate 
it into the current pipeline. 

Wellner (2009 ) pointed out that verbs from the attribution spans are useful features 
in identifying the argument head words. This suggests that we can feed back the results 
from the attribution span labeler into the argument labeler. In fact, we can feed back all 
results from the end of the pipeline into the start, to construct a joint learning model. 

8 Conclusion 

In this work, we have designed a parsing algorithm that performs discourse parsing in 
the PDTB representation, and implemented it into an end-to-end system. This is the 
first end-to-end discourse parser that can parse any unrestricted text into its discourse 
structure in the PDTB style. We evaluated the system both component-wise as well as 
in an end-to-end fashion with cascaded errors. We reported overall system Fi scores 
for partial matching of 46.80% with gold standard parses and 38.18% with full automa- 
tion. We believe that such a discourse parser is very useful in downstream applications, 
such as text summarization and question answering (QA). For example, a text summa- 
rization system may utilize the contrast and restatement relations to recognize updates 
and redundency, whereas causal relations are very useful for a QA system to answer 
w/iy-questions. 
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